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Natural Language Processing and Relational 
Data Extraction Routines in AutoMap 

= Stemming: Converts words into their morphemes. 
=" Reduction and Normalization: 
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| \ AutoMap: NT 4 aie " Negative filters such as delete lists, removal of symbols 
sticiimaaticiiades and formatting, removal of numbers 
Extract ee . . 
: Relational Data " Positive filters such as thesauri, spelling correction, 
ot Cag relational synonym sets, antonym sets 
WisEEDIA Data = Part of Speech Tagging: Assigns a single best 


Network 


Analysis grammar classifier or lexical category to every 


word. 
=" Anaphora Resolution: Converts personal 
pronouns into the entity or entities that the 
pronouns refer to. 
= Named Entity Extraction: Identifies relevant types 
of information that are referred to by a name, 
Illustrative Toy Example: " | d | j 
“Jan Pronk, the Special Representative of Secretary-General Kofi Annan to SUC as ee e, organizations, an ocations. 
Sudan, today called for the immediate return of the vehicles to World Food Ontological Text Coding: Classifies relevant types 
proximity-based extraction of relational data : d fi d ; , h 
enenode ace ulanlvedtinnelsecee taxonomy. User-detined categorization schemata 
@ Jan Pronk a Jan Pronk can be applied. 
@ Sudan @ Koti Annan Kofi Annan " Identification of and reasoning about node and 
@ vehicles vehicles edge attributes, such as demographic data, 
beliefs, and types of relationships. 
=" Email Data Analysis: Extracts and combines 
@ Knowledge Person @ Organization different types of networks, such as social 
Location _Aesowree _] networks and knowledge networks, from emails. 
Identification: Classification: . Feature Identification: ogy ole term weights, TF*IDF 


For relational data with at For ontologically coded = Entropy Assessment: Determines the variability 


least one node type: Locate/ networks: Classify relevant or heterogeneity of a text document or corpus 
identify relevant nodes nodes according to an with respect to its vocabulary. 
i iia iat a ll = Classical Content Analysis 


=" Read and write data and processing material from 
and to a default or user-specified database. 


Visualization || Simulation 


@ wrp WFP NGO’s 


Development of Computational Solutions 


= Utilize machinery from Machine Learning and 
Artificial Intelligence 
Deploy and develop supervised and semi- 
supervised sequential stochastic learning 
techniques in order to train classifiers and 
build models that generalize to new data 
Construct a classifier h that for every sequence 
of (x, y) (joint probability) (where x = words 
per sequence and y = corresponding category) 
or (x/y) (conditional probability) predicts a 
sequence y = h (x) for any sequence of x, 
incl. new and unseen data 

=" We work with Generative (aka discriminative) 
models: P(x,y), such as Hidden Markov Model 
(HMM), and Conditional models: P(y/x), such 
as Maximum Entropy Markov Models (MEMM) 
and Conditional Random Fields (CRF) 





Example: Conditional Random Fields for Entity Extraction and 
Ontological Text Coding 


= Identify and classify words that represent instances of entity classes of models or 
ontologies that deviate from classical set of Named Entities. 

=" Crucial step for coding texts as social-technical networks according to domain- 
specific ontologies and for advanced modeling of complex and dynamic real-world 
organizations or networks. 

= Model relationship among y,; and y,, as Markov Random Field conditioned on x 

=" Conditional distribution of entity sequence y given observation sequence x 

computed as normalized product of potential functions M.,. 


a 
LM i-1> yi I x) 


M ,(yj-,>y |x) = (exp EA, fever yeX)* EMe 8 ond) ai 2 'M , (x) 


Start , stop 


=" Conditional probability of label sequence P(y/x), where both x and y are arbitrarily 
long vectors (consider arbitrarily large bag of features (> 10,000) and any property 
of x, such as long-distance information) 
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